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Abstract 

This paper introduces the parallel hierarchical sampler (PHS), a 
Markov chain Monte Carlo algorithm using several chains simultane- 
ously. The connections between PHS and the parallel tempering (PT) 
algorithm are illustrated, convergence of PHS joint transition kernel 
is proved and and its practical advantages are emphasized. We il- 
lustrate the inferences obtained using PHS, parallel tempering and 
the Metropolis-Hastings algorithm for three Bayesian model selection 
problems, namely Gaussian clustering, the selection of covariates for 
a linear regression model and the selection of the structure of a treed 
survival model. 
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Introduction 



Let 9 £ be a random variable with distribution Markov chain Monte 

Carlo (MCMC) algorithms generate discrete-time Ma rkov chains ^9j}^L 1 



lavin g n(#) as their unique stationary distr ibution ([Robert and Ca sella 



1999]). MCMC methods wer e pioneered by 



and by 



Metropolis et al 



Metropolis and Ulam 



194 



19531 ] in the field of statistical mechanics. They 
have been adopted in statistics to approximate numerically expectations 
of the form Ejj(g(9)) where g(-) G L 2 (I1), i.e. the function g(-) is square 
integrable with respect to n(#). In Bayesian statistics, when the poste- 
rior distribution of a parameter 9 given the data X, U(9 \ X), cannot 
be integrated analytically with respect to its dom inating measur e, its rel- 



evant features can be approximated via MCMC (jTiernev 



fer to 


Gelfand and Smith 


1990 


1. 


Smith and Roberts 


1993 


LNeal 


Gilks et al. 


19951. Gamerman 


19971. Robert and Casella 


1999(1 a 


2001] among; others. 



19941 ]). For a 
may re - 



1993], 



Liu 



This paper illustrates a novel Markov chain algorithm, which we find 
useful for sampling from highly multimodal target distributions. We label 
this algorithm parallel hierarchical sampler (PHS) because of the prominent 
role of one chain with respect to the other generated chains. An impor- 
tant feature of PHS is that one array of Monte Carlo samples is gener- 
ated using many chains run in parallel. PHS has in fact many connec- 
tions with the Metropolis-c o upled Markoy chain Monte Carlo samplers of 



Swedensen and Wang 



1987 ] 



Gever 



199J and of 



Hukushima and Nemoto 



1996]. The main advantage of PHS with respect to other samplers is that 



the proposed updates are always accepted, thus ensuring optimal mixing of 
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the resulting chain. 

In Section 1 of this paper we review some foundations of MCMC meth- 
ods relevant for our work, with emphasis on the Metropolis-Hastings (MH) 
algorithm and on parallel tempering (PT). In Section 2 we introduce the 
PHS algorithm, we prove the reversibility of its joint transition kernel with 
respect to its target distribution and we illustrate the relationships between 
PT and PHS. Sections 3 and 4 illustrate two examples comparing the infer- 
ences obtained using the PHS algorithm with those of MH and PT within 
the Bayesian model selection framework. The first example deals with data 
clustering. In the second example we consider the problem of selecting the 
best subset of covariates for a Gaussian linear regression model. Section 5 
illustrates the application of PHS for deriving posterior inferences for the 
structure of a treed survival model. Section 6 discusses the current results 
and some ongoing developments of this work. 

1 Foundations of MCMC algorithms 

Let 6 be a random variable with distribution 11(6). In what follows, if f 6 is 
continuous we let f(6) be its probability density with respect to Lebesgue 
measure whereas if it is discrete f(6) is its probability mass function. When 
it is not possible to obtain independent draws from H(6), Markov chains 
can be used to generate dependent realisations {6i}fL 1 having stationary 
distribution H(6). Here the conditions under which such Markov chains can 
be constructed are stated and two algorithms are illustrated. 

Let K(6i,6i + \) be a transition kernel defining the probability to jump 
between any two values of 6. If there exist an integer d > such that the 
probability of a transition between any two values (6>j,#j + i) with f(6i) > 
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and > in d steps is positive, the transition kernel is Il-ir reducible. 

K(9i, 9i+i) is aperiodic if it does not entail cycles of transitions among states. 
A sufficient condition for aperiodicity of an irreducible transition kernel is 
that K(6i,6i) > for some 6i G 0. The transition kernel is reversible with 
respect to H(6) if it satisfies the detailed balance (DB) condition 



f(9 i )K(9 i ,e i+1 ) = f(0 i+1 )K(0 i+1 ,Oi). 



1 



If a reversible and fl-irreducible tr ansition kernel is also aperiodic, H(9) 



is its unique stationary distribution (|Nummelin 



19841 ] . 



Robert and Casella 



1999]). In such case, the strong law of large numbers holds for any function 
g(-) G L 2 (n), that is 



a.s. 



Further more, under these conditions also the central limit theorem holds so 



that ( Tiernev 



199J) 



N 



1=1 / 



where the asymptotic standa rd deviation o 9 k 



and on the transition kernel (IMira and Gever 



depe nds on the function g( 



1999]) 



1.1 The Metropolis-Hastings algorithm 



The Metropolis-Hastings algorithm ([Hastings 



1970 ] ) implements a family of 



transition kernels reversible with respect to an arbitrary target distribution 
n(#). Markov chains are generated by the MH algorithm by converting the 
independent draws from a proposal distribution q(-) in to dependent samples 



from n(#) through a simple accept /reject mechanism (jChib and Greenberg 
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1995], iBillera and Diaconicsl 200l[ |). Without loss of generality, in what 
follows we let q(-) assign zero probability to the current state 9{. Under this 
condition, the MH transition kernel can be written as 

K MH (9 h 6 i+ i) = a MH (0i, e i+1 )q(9 i+1 I 9i) if 9 i+1 + 9 h (2) 

where the acceptance ratio aMH^ii^i+i) is defined as 

a MH (9 i ,9 w ) = lA mq(di+il0i) ■ (3) 

The irreducibility and aperiodicity of the MH transition kernel and its 
practical performance for specific sampling problems hinge mainly on the 



appropriate choice of the proposal distribution q(-) (iTiernevi [1994I ]). Re- 
versibility with respect to f(9) can be immediately verified by checking that 
(1) holds using equations (2) and (3). Furthermore, by equation (3) the MH 
algorithm requires the evaluation of the function f(9) only up to a multi- 
plicative constant, which is a key feature for Bayesian applications where the 
target posterior distribution is ty pically known up to a finite multiplicative 



factor (Gelfand and Smith 



19901]). 



1.2 Parallel tempering 

Parallel tempering (PT) is a multiple-chains extension of the MH algorithm 
which can improve mixing when the MH sample paths exhibit high correla- 



tions rtGeyer 



19911 ] ) . Poor mixing of the MH algorithm typically arises when 



the ta r get distribution is multimodal. As e mphasized 



19871 ] . 



Hukushima and Nemoto 



by 



Swedensen and Wang 



1996] and iLiul 20011 ] . in statistical mechan- 



ics such distributions may arise in the analysis of stochastic interacting par- 
ticle systems, such as spin glasses and the Ising model. In Bayesian statistics 
multimodality of the posterior distribution might occur when an informative 



prior disagrees with the likelihood, in highly structured hierarchical models 
and when several nuisance parameters are integrated out of the joint poste- 
rior. An impotant example of the latter case is provided by Bayesian model 
selection methods using the marginal posterior probability of the model 
structure. 

The PT algorith m appears in the literature i n different forms and un- 



der different labels. 



Swedensen and Wang 



Monte Carlo" algorithm, 



Geyer 



19871 ] introduced the "Replica 



19911 ] lab eled his algorithm "Metr o polis- 



Hukushima and Nemoto 



19961 ] 



coupled MCMC", whereas the algorithm of 
is labeled "Exchange Monte Carlo". Here we give a unified description of 
PT encompassing those of Geyer, Liu and Hukushima and Nemoto. 

Let 1 = T\ < T2 < ... < Tm < 00 be a fixed real- valued vector of 
temperature levels. For each value of the index m £ [1, M] a heated version 
of the target density f(8) is defined by "powering up" f(8) as 



fm(8) 



(4) 



where C m > is a finite normalising constant depending on the temperature 
parameter T m . The latter acts as a smoother of the target distribution of 
the cold chain, which has temperature one, so that the heated densities 
have fatter tails and less pronounced modes with respect to fi{8). The PT 
sampler proceeds, at each iteration, by alternating an update step with a 
swap step. The former is carried out by updating each chain independently 
of the others typically via the Gibbs sampler using one or more embedded 
MH steps. To perform the swap step, let Sj have value if update is chosen 
at iteration i and 1 if swap is chosen instead. The proposal prob ability 



Geyer 



q e (Sj I Sj_i) describes how the two steps are combined by the sampler. 
19911 ] adopts the deterministic proposal q' s (si | Sj_i) = l{ s< _ 1= o}, whereas 



Liu 2001 ] defines an independent PT sampler using q s (si | Sj-i) = s where 
s £ (0, 1) is a fixed swap proposal rate. Let the indexes ji and ki range over 
(1, M) and let 9j l indicate the state of chain ji at iteration i. The second 
proposal employed by the PT sampler, q a (0^,0^) defines the probability 
that at iteration i a swap is att empted betw een the current values of the 



chain s with indexes (i i,kj). In 



Gever 



19911 ] . in 



Hukushima and Nemoto 



1996] and in lLiul 20011 ] this proposal is taken as independent of the current 
states of the two chains #f%#f J but only dependent on thei indexes (ji,ki). 
Specifically, the swap proposal used by these authors is uniform over all 
possible values of the ordered couple (ji,ki) with ki ^ ji and a swap is 
accepted with probability 



no3i nkil rnki nji]\ _ 1 A tli CVJ-M^') 



(5) 



ensuring the reversibility of the PT sampler with respect to its joint target 
distribution. When the independent updates of each chain are carried out 
using a single MH step, the joint transition kernel of the PT sampler is 

M 



Kp T (9M,i,0M,i+i) = (1 - q' s (si | [J q(9f +1 | 9f)aMH(0T,e 



+ 



w=l 

M M 



+q s (si I si-i) Yl T,^Ui,ki)a s {\er,e^],\e1',er]). (6) 

31=1 fcj=l 

where 0M,i = Of 1 ] is the state of all M chains at iteration i. From (6) 

it can be seen that when the within-chain updates produce high correlations, 
PT increases mixing for all chains through their successful swaps. Analo- 
gously to the MH algorithm, the irreducibility and aperiodicity of the PT 
transition kernel depend mainly on the proposal distribution for the within- 
chains update, q(-) and on that of the cross-chains swaps q s (•). A proof of the 
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revers ibility of the PT algorithm can be found in 



Hukushima and Nemoto 



1996]. Finally we note that, analogously to the MH algorithm, by equa- 



tions (4) and (5) the implementation of PT does not require knowledge of 
the finite normalising constants {C m }^f =1 so that it is a suitable MCMC 
sampler for Bayesian posterior simulation. 



2 The parallel hierarchical sampler 

A key difficulty affecting the general applicability of the PT sampler is its 
dependence on the values of the temperatures {T m }^ =1 . In statistical me- 
chanics, the latter are chosen with reference to the physical properties of 
the systems being modeled, such as the energy barriers implied by succes- 
sive temperature levels. However, in statistics the equilibrium distributions 
being simulated seldom possess analogous interpretations. An alternative 
solution illustrated in this Section is to employ a multiple chains sampler 
such that the equilibrium distributions of all chains is the same but the 
proposal distribution used to update each chain is different. Specifically, 
definition 1 describes a multiple-chains sampler which does not employ tem- 
peratures and combines independent updates with swap moves within each 
iteration. 

Definition 1 Let a multiple- chains MCMC sampler proceed by carrying out 
both the following two steps at each iteration: 

i ) let the index rrii be drawn from a discrete proposal distribution q ' s {mi \ 
mi—i) symmetric with respect to its arguments; 

ii ) swap the current value of chain rrii and that of the first chain; 
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Hi) update independently the remaining M — 2 chains each having the same 
marginal target distribution f(0). 

At point ii) above, we indicate as q ' s (■) the swap proposal to emphasize the 
analogy with the PT algorithm. We label the algorithm defined above par- 
allel hierarchical sampler (PHS) because the first chain is given a prominent 
role and the update of all chains is carried out in parallel analogously to PT. 
To provide a simple proof of the reversibility of the PHS joint kernel, in this 
Section we assume that the chains (2, ..M) are updated using a single MH 
step and that the t ransition kern els for these MH updates satisfy the condi- 
tions illustrated in 



Tiernev 



1994] so that they are irreducible and aperiodic 
with respect to their marginal target distributions. In addition, we assume 
that the symmetric proposal distribution q'g(-) allows for swaps between the 
first chain and any of the other chains. Under these conditions the marginal 
transition kernel for the first chain of the PHS algorithm is irreducible and 
aperiodic with respect to its target distribution. Let 9m = 9 x 9 x ... x 6 be 
the M-fold cartesian product of the random variable 9. By the arguments of 
Section 1, if the PHS joint transition kernel is also reversible with respect to 
the product density h{Qm) having all marginals equal to f(0), then h(9m) 
is the unique joint stationary distribution of the sampler. The reversibility 
of the PHS is proved in the following theorem. 

Theorem 1 The joint transition kernel of the PHS algorithm of Definition 
1 is reversible with respect to the joint distribution having product density 
or probability mass function (a(0m)- 

Proof The DB condition for the PHS algorithm is 



(7) 



9 



where KpHs{&M,i+i,0M,i) is the PHS joint transition kernel. When the 
independent updates of the chains (2, ...,M) are carried out via a MH step, 
the PHS joint transition kernel can be written explicitely as 

M M 

KpHs(OM,uOM,i+i) = E ^V* I II I 0i)aMH(ej,9j +1 ). (8) 

Each summand in (8) is the product of the marginal transition kernel for 
the swap transition and those of the (M — 2) independent MH updates for 
the remaining chains. The former coincides with the proposal q' s {mi | mj_i) 
because the PHS swap acceptance ratio is equal to one. This fact will be 
motivated in the next Section by illustrating the relationship between PHS 
and PT. Under (8) the DB condition (7) can be rewritten as 

M M 

£ glim, | mi-i) [J qiel +l | = 

mi =2 j=2 

M M 

= £ q'Um^ | mi) \{ qifil \ ^M^i- 9 l) ( 9 ) 

m,=2 j=2 

For any given value of rrii, by the reversibility of (2) and (3) with respect to 
/(#), the M — 2 MH transition probabilities on the left-hand side of (9) are 
equal to their corresponding terms on the right-hand side. By taking q'g(-) 
symmetric with respect to rrii and mj_i, for all values of rrii each summand 
on the left-hand side of (9) equals its corresponding term on the right-hand 
side, so that the equality (9) holds. o 

Equation (8) implies that, as for the MH and PT algorithms, PHS does 
not require knowledge of the normalising constant of its marginal target 
distributions C so that it is suitable for sampling from target distributions 
known only up to a finite multiplicative factor. 
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2.1 Relationship between PHS and parallel tempering 

Both (6) and (8) are mixtures of marginal transition kernels respectively 
denning the joint transition probabilities for the PT and PHS algorithms. 
The analogy between the two is that (M — 1) out of the M parallel chains 
are auxilliary and Monte Carlo estimates are computed using the samples 
of the first chain only. There are two important differences between the 
two samplers. At each iteration, the PHS transition kernel mixes over the 
update and swap steps as described in Definition 1 whereas in PT they are 
alternated according to the proposal probability q' s (si | Sj_i). Since the for- 
mer step typically generates local transitions whereas the latter produces 
larger jumps, PT creates unnecessary competition between local and global 
mixing. Furthermore, in PHS all marginal target distributions are not pow- 
ered up using a temperature coefficient as in PT. The rationales to avoid the 
temperature coefficients are both conceptual and practical. From a Bayesian 
prespective, the main conceptual issue is that the temperatures do not ap- 
pear neither in the likelihood function nor in the prior, so that it is not clear 
whether they should be treated analogously to the other parameters indexing 
the target posterior distribution. In practice, determining sensible values for 
the temperatures requires a lenghty trial-and-error process in the pursuit of 
a target swap rate between pairs of chains. Moreover, since the normalising 
constants of the marginal posterior distributions depend on their tempera- 
tures, updating of the latter is not possible unless for conjugate f amilies. The 



simulated tempering algorithm of 



Gever and Thompson 



1995] implements 



a single chain sampler for both the parameter 6 and a single temperature co- 
efficient T. The latter is treated as a discrete random variable and its update 
is carried out using a data dependent pseudo-prior in order to simplify the 
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normalising constant of the joint posterior distribution from the Metropolis- 
Hastings acceptance ratio. An interesting point in simulated tempering is 
that the temperature is not constrained to be larger than one, so that when 
T is close to zero the posterior distribution becomes concentrated around 
its modes. However, although practically useful, the simulated tempering 
algorithm does not clarify the nature of the temperature parameter and it 
does not explain how the posterior normalising constant could be seen as 
part of a prior distribution for the same parameter. 

In PHS, since all temperatures have value 1, the Metropolis swap ac- 
ceptance ratio (5) is equal to one, so that the proposed moves for the first 
chain are always accepted. This property marks the most evident difference 
between the sample paths of the first chain of PHS, those of the cold chain 
of PT and those of the MH algorithm. 

2.2 PHS as a variable augmentation scheme 



Varia ble augmentation for MCMC samplers was first introduced by 



Tanner and Wong 



1987]. Its general principle is that convergence of one of the generated 
chains can be sped up by cleverly augmenting the state-space using addi- 
tional coefficients. Conditionally on these auxilliary variables the posterior 
distribution of the parameters of interest can typically be sampled exactly. 
The PHS algorithm can be seen as a variable augmentation scheme where 
the additional coefficients are M — 1 replicates of the parameter of interest 
itself. In PHS the target distribution of the first chain does not depend on 
the other replicates. At each iteration, the M — 1 auxilliary chains having 
index (2, ..,M) directly provide a set of potential updates for the chain of 
interest. 
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2.3 PHS and multiple-try Metropolis algorithms 



Liuetal 



2000 ] illustrate a generalization of the single chain Metropolis 
algorithm where multiple values from the same proposal distribution are 
drawn at each iteration of the sampler. To attain detailed balance, a gen- 
eralized Metropolis acceptance ratio involving several pseudo-current chain 
states is computed. This multiple-try generalized Metropolis sampler ac- 
tually mimics the behaviour of a multiple chains algorithm using the sme 



proposal within each of t 
between the algorithm of 



re genera t ed ch ains. Therefore, the main analogy 



Liu et al 



2000| | and PHS is that many c andidates 



are av ailable at each iteration to update one chain of interest. In 



Liu et al. 



2000], only one of such updates is retained and the Metropolis ratio is modi- 
fied accordingly. In PHS, all such values not used for swapping with the first 
chain are retained and individually updated. Moreover, the proposal mech- 
anism generating all potential updates is not constrained to be the same for 
all chains. 



3 An illustrative example: MCMC generation of 
mixtures of Gaussian variates 

In this Section we report a comparison between the empirical performance 
of MH and PHS algorithms for generating a sample from a mixture of scalar 
Gaussian random variables. We use the results of one simulation to illustrate 
the typical difference between the performance of the two samplers. 

Within the MH algorithm we use a random walk uniform proposal dis- 
tribution on the interval (#j — 5, 6i + 5), where Q% is the current value of the 
chain. We construct a PHS algorithm using the same Metropolis updates 
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within chains (2, M) and by adopting a uniform swap proposal distribu- 
tion q" (mi | rrii-i) = For this example we let M = 10, that is we 
employ nine auxilliary chains having the same proposal spread 5 = 1 as that 
of the MH sampler. The PHS sampler was run for one hundred thousand 
iterations. In order to make the computational cost for both samplers com- 
parable, the MH algorithm was run for one million iterations. All chains 
were started at the same initial value equal to zero. 

The number of components of the mixture was set to 5, their means were 
generated uniformly at random over the interval (—10,10) obtaining the 
values (—8.85,-2.65,2.63,3.85,4.35). Their standard deviations were gen- 
erated uniformly at random over the interval (0.1, 1), obtaining the values 
(0.18,0.51,0.50,0.42,0.24). Finally, the unnormalised weights of the mix- 
ture components were drawn uniformly at random over the interval (1,5). 
Thier normalised values are (0.22,0.22,0.23,0.15,0.18). Figure 1 shows the 
probability density of the mixture over the range (—13.5, 6.6). The mixture 
components having means (2.63, 3.85) and standard deviations (0.50, 0.42) 
are very close and they do not result in two separate modes of the mixture 
density. Figure 2 compares the histograms of the MH draws with that of 
the first PHS chain. The former sampler effectively located the three closest 
modes to its starting value whereas PHS successfully visited all four modes 
of its target distribution. 
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Figure 1: the Gaussian mixture density used as target distribution for the Metropo- 
lis sampler and for the parallel hierarchical sampler. The distribution is a mixture 
of five Gaussian components having means (—8.85, —2.65, 2.63, 3.85, 4.35), standard 
deviations (0.18,0.51,0.50,0.42,0.24) and weights (0.22,0.22,0.23,0.15,0.18). 
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Figure 2: the plot on the left-hand side shows the histogram of the Metropolis 
draws. On the right-hand side, the plot represent the histogram of the draws of 
the first PHS chain. Thanks to the swapping mechanism, the latter successfully 
visited all the four modes of its marginal target distribution whereas the Metropolis 
algorithm only visited the three closes models to its starting value. 
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4 Application to the selection of covariates for the 
Bayesian linear regression model 

The covariates selection problem for the Bayesian Gaussian linear regression 



ras been addressed model usin g MCMC methods by 



1988], 



1995], 



19981 ] . 



Smith and Kohn 



1996] 



George and McCulloch 



Dellaportas et al. 



[20021 ] 



Mitchell and Beauchamp 



George and McCulloch 



19971 ] 



and 



Rafterv et al 



1993] 



199 



Clyde and Georgi 



Carlin and Chib 



Kuo and Mallick 



20041 ] among many 



others. 

Using the same notation as in 



George and McCulloch 



1997J, we let the 



distribution of the n-dimensional random vector Y be multivariate Gaus- 
sian with mean X~/3L and covariance matrix a 2 I n , being (a, (3, 7) a priori 
unknown. The p-dimensional model index 7 has elements 7^ taking value 
one if the jth covariate is used for the computation of the mean of Y and 
zero otherwise. Here /5L and include respectively the elements of the 
p-dimensional column vector (5 associated to non-zero components of 7 and 
the corresponding columns of X. The latter is a real-valued n x p matrix 
representing p potential predictors for the mean of Y. Within this frame- 
work, the variable selection problem consists of deriving inferences for 7 
conditionally on the data (Y, X). In order to compute such inferences, we 
employ MH, PT and PHS to generate draws from the marginal posterior 
probability of the model index, 

P( 1 \Y,X)^P{ 1 )P(Y \ 1 ,X). 

In this Se ction we adopt the sam e form of the marginal posterior probability 



of 7 as in 



Nott and Green 



20041 ] . letting 



P(7 I Y, X) oc (1 + n) 



■9(7) 
2 



Y Y 



n 



n + 



-Y X^(X^X^) 1 X^Y 



(10) 



17 



We also note that if the predictors included in tend to be collinear, the 
matrix X'^Xy can be almost singular and the right hand side of equation 
(11) may be numerically unstable. In such instances, we find that computing 
the marginal posterior using the Cholesky decomposition of X'^X^, as in 



Smith and Kohn 



1996? ], yields numerically stable results. 



In order to compare the PHS estimates with that of the MH and of 
the PT algorithms, we consider two simulated datasets. In both cases the 
dependent data is Y ~ iV(Xy/3 7 , 6.25/iso) and the regression coefficients are 
set at /3 7j . = 2j'/15 for j = 1, 15. For the first dataset, X is generated as a 
180 x 15 matrix of i.i.d. draws from a Normal distribution with mean zero 
and variance 1. Let Z\,...,Z\q be i.i.d Gaussian column vectors of length 
180 with mean zero and covariance matrix Iiso- F° r the second dataset a 
strong collinearity was induced among the predictors X by letting 



X 4 



Zj + 2Z W for j = 1, 15, 



where Xj is the jth column of X, as in Section 5.2.1 of 



George and McCulloch 



1997]. Here we will compare the estimation results of the MH algorithm 
with those of the cold chain of PT and of PHS for the two datasets using 
the estimated marginal posterior inclusion probabilities for each predictor 
and their Monte Carlo standard errors (MCSEs). The former are defined as 
SfcLi Tj/-^ where i = 1, ...,N is the i teration index and 7j is the ith 



7j 



draw for the jt h. predictor. As illustrat e d by 



2004] and by 



Geweke 



George and McCulloch 



19921 ] 



Nott and Green 



19971 ] . the MCSE for the inclusion 



probability of the jth predictor is 
MCSEfy 



\h\<N v 7 
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where Aj(h) is the lag h autocovariance of the chain of realisations for 7,-, For 
ergodic Markov chains, as N —* 00 the MCSE converges, up to an additive 
const ant independent of the transition kernel, to the MCMC standard error 



a a k (jMira and Gever 



1999]) where gipij) = E(^j \ Y,X) for this example. 
Three independent batches of chains were run for fifty thousand itera- 
tions. For PT and PHS, we used nine chains which target distributions are 
defined as in (4) with cold distribution (11). For PT, the heated chains were 
defined using the same array of equally spaced temperatures with range 1 — 5. 
For each sampler, the starting values of 7 for all chains was the null model. 
All within-chain updates were carried out using a component-wise random 
scan Metropolis algorithm proposing a change of the cur rent value of each 



parameter jj at every iteration, as in 



Denison et al 



19981 ] . The cross-chains 



proposals q s (-) and q s (•) were taken uniform. Since the PT algorithm also 
depends on the proposal q' s (-), we run three batches of PT chains using Liu's 
proposal q' s (si | Sj_i) = s with s = 0.2,0.5,0.8. 

Figure 5 illustrates the simulation results. The plots on the top row refer 
to the data without collinearity whereas the plots on the bottom report 
the inferences for the data with collinearity. By comparing the two rows 
of Figure 5, it appears that the induced collinearity among the predictors 
did not affect the estimation results for any of the samplers. The estimated 
inclusion probabilities are generally increasing with respect to the true value 
of their regression parameters /? 7 for all samplers and for both datasets, with 
a noticeable shift occurring between the fifth and the sixth predictors (which 
regression parameters are respectively /3 75 = 0.67 and /3 76 = 0.8). Higher 
swap rates for the PT algorithm, marked by circles and by plus signs in 
Figure 5, result in a large decrease in the estimated inclusion probabilities 
for the predictors corresponding to large regression coefficients /3L. The 
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estimated inclusion probabilities for the predictors associated to low values 
of the regression coefficients are the lowest for the MH algorithm whereas 
the PHS estimates are the highest for these predictors. Finally, the plots on 
the right-hand side of Figure 5 suggest that for this example the precision 
of the three samplers, as measured by their MCSEs, is roughly comparable. 



5 Application to the estimation of the structure 
of a survival CART model 

In regression and classification trees (CART) the sample is clustered in dis- 
joint sets called leaves. The leaves are the final nodes of a single-rooted 
binary partition of the covariates space which we will refer to as the tree 
structure. Within each leaf, the response variable is modeled according 
to the regression, or c lassification or with the survival analysis frameworks 



(Breiman et al 



with the papers of 



19841 ). Bayesian CART model s appeared in t he lit erature 



Chipman et al 



19981 ] and 



Denison et al. 



19981 ] . The 



MCMC model search algorithms developed in these two papers treat the 
tree structure as an unknown parameter and explore its marginal poste- 
rior distribution using the Gibbs sampler and the MH algorithm. In this 
example we focus on tree models for randomly right-censored survival data 



Gordon and Olshen 



1992a 



19951 



Davis and Anderson 



M.Leblanch and J.Crowlev 



tree model has been proposed by 



19891 ] 



M.Leblanch and J.Crowlev 



1992 bjl). The first 



Pittman et al 



Bayesian survival 



20041 ] . who adopted a 



Weibull leaf sampling density and a step-wise greedy model search algo- 
rithm based on the evaluation of all the possible splits within each node. 
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Figure 3: estimated marginal posterior inclusion probabilities for the fifteen pre- 
dictors (left) and their Markov chain standard errors (right). The plots on top 
refer to the simulated data without collinearity and the bottom plots refer to the 
data with collinearity. In all plots, plus signs correspond to the results of the PT 
algorithm with s = 0.8, circles correspond to PT with s — 0.5 and asterisks mark 
the PT results with s = 0.2. Triangles identify the results of the PHS algorithm 
and dots mark those of MH. The PHS estimates for the marginal inclusion proba- 
bilities are the highest for the predictors associated to low values of their regression 
coefficients /3 7 , whereas high swap rates for the PT algorithm produce wrong esti- 
mates of the inclusion probabilities for the predictors associated to high values of 
(3-y. The precision of the three algorithms, as measured by their MCSEs, appears 
to be comparable. 
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The main strength of this model is that it incorporates a flexible parametric 
form for the survival function and that the tree search quickly converges to a 
mode in the model space. In this Section, we propose a fully Bayesian anal- 
ysis of the marginal posterior distribution over the space of tree structures 
using PHS under the Weibull leaf likelihood. Upon convergence, besides 
providing the structure of the estimated modal tree, the realisations of the 
cold chain provide a sample from the marginal posterior distribution of the 
tree structure which can be employed to rank the combinations of the co- 
variates defining the visited trees as a function of their estimated marginal 
posterior inclusion probabilities. 

5.1 Tree structure marginal posterior distribution 

Let the survival times {tj}" =1 be independent random variables condition- 
ally on the tree structure (6, £) and on the Weibull leaf parameters (a^, /3^). 
Under this assumption, the joint sampling density of the survival times can 
be written as 

b n f 6 \ lfe i 

at i x, 5, b, c, a c , p c ) = n n ( 3 e ~^ k ) > ( n ) 

k=ij=i ^ ' 

where b is the number of leaves, 5j takes value 1 for exact observations and 
for right censored observations and lk,j = l{x,-eCk} * s ^ ^ ^ ne covar i a t e 
profile of the jth sample unit is included in which is the subset of the 
covariate space corresponding to leaf k = 1, 6, and otherwise. Under a 
discrete uniform prior for the tree structure, the marginal posterior proba- 
bility P(b, C | t, X, 5) can be obtained, up to a multiplicative constant, by 
integrating (11) with respect to the conditional prior distribution for the 
array of leaf parameters (a^,f3^). In this work we place inde pendent un- 



informative priors for each Weibull leaf parameter. 
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Sun 



19971 ] provides an 



accurate study of different uninformative priors for the Weibull distribution. 
In particular, Sun's paper indicates that h^aj., (3^) = l/a^Pk is the Jeffreys' 
prior and the first order matching prior for the Weibull parameters of leaf k 
given the parametrization (11). Sun also shows that, under hk(ak, /?&), the 
joint posterior density for the leaf parameters (afc,/3fc) is proper when the 
number of data points is larger than one and if all the observations falling 
in leaf k are not equal. For this specification of the prior structure, the joint 
posterior of the tree structure and of the leaf parameters can be written as 

f(b,(,a c ,P c \t,X,8)<xl[—-l[ ( (afc&tf" 1 ) V^M '. (12) 

k=l akPk j=l V J 

The Weibull scale parameters can be integrated out of the joint pos- 
terior (12) analytically. The resulting integrated posterior density is 

f(b, C, a c \t,X,5)<xT\ ^ J 3 — k ^-n . (13) 

Under (13), analytical integration of the Weibull index p arameters ak is 



not po ssible. For any given model, the Monte Carlo method of 



Chib and Jeliazkov 



20011 ] can be employed to compute a simulation-based approximation of the 
marginal posterior probability of the tree structure. However, since this in- 
tegration needs to be performed for each visited tree, the computational 
cost of Chib and Jeliazkov's method makes it unsuitable for any iterative 
model search. In this work we approximate the tree structure marginal pos- 
terior probability using the Laplace expansion of equation (13), which can 
be written as 

F(6,< | t.X.S) « exp + ± (log(/(«) - MzMl**k)) , (14 ) 

where % is the posterior mode of the Weibull leaf log index parameter 
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r/k = log(afe). The derivation of equation (14) and the explicit forms of the 
functions l(n) and hiv) are reported in the Appendix. 



5.2 Marginal posterior inference for the tree structure 



In the CART framework, as in the clustering problem illustrated in Section 3, 
the main challenge for constructing efficient within-chain proposal distribu- 
tions is the lack of a distance metric between different models. This issue has 
been also noted by 



MCMC algorithm ( 



Brooks et a 



Green 



2003] in the context of the reversible jump 



1995j |). Our specificati on of the within-chain pro 



posal distribution gen eralizes the approaches of 



Denison et al 



Chipman et al 



19981 ] and 



1998] by devising two additional within-chain transitions 



besides their insert, delete and change moves. For the within-chain updates 
we propose a transition at random among the following five types: 

1) Insert: sample a leaf at random and insert a new split by randomly 
selecting a new splitting rule. 

2) Delete: sample at random a leaf pair with common parent and at most 
one child split and delete it. 

3) Change: resample at random one splitting rule. 

4) Permute: sample a random number of splits and permute at random 
their splitting rules. 

5) Graft: sample at random one of the tree branches and graft it to one 
of the leaves of a different branch. 



Chipman et al 



1998] noted that their MCMC algorithm can effectively re- 



sample the splitting rules of nodes close to the tree leaves but the rules 
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defining splits close to the tree root are seldom replaced. In our specifi- 
cation of the within-chain transitions, move number 4 aims at improving 
sampling of the splitting rules at all levels of the tree structure. Further- 
more, the fifth move type allows the sampler to jump to a tree structure 
distinct from the current one without changing its splitting rules. 

Having adopted a multiple-chains algorithm, we also devised two types 
of cross-chains transitions. The first is the cross-chains version of the insert, 
graft and change transitions, swapping the elements of the tree structure 
required to perform corresponding pairs of transitions across chains. The 
second class of cross-chains transitions includes a whole tree swap between 
chains. 

At iteration i, the PHS algorithm for this example proceeds as follows: 

1) choose at random one of the heated chains mj £ [2, M] and propose 
at random one of the cross-chains moves, accepting the swap with 
probability 1. 

2) update each of the remaining M — 2 chains independently using the five 
types of within-chain transitions and the MH acceptance probability. 

5.3 Analysis of a set of cancer survival times 

Colorectal adenocarcinoma ranks second as a cause of death due to cancer in 
the western world and l iver metastasi s is th e main cause of death in patients 



with colorectal cancer (Pasetto et al. 



20031 ] ). The survival times of 622 pa- 



tients with liver metastases from a colorectal primary tumor were collected 
along with their clinical profiles by the International Association Against 
Cancer (http://www.uicc.org). Table 1 reports a description of the nine 
available clinical covariates. The survival times of this dataset are currently 
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included in the R library locfit (http : //ww w.locfit.info). This data has 



be en analyzed by 



by 



Hermanek and Gall 



Antoniadis et al 



1990 ] using non-parametric methods, 



1999] using their wavelet-based method for esti mating 



the su rvival density and the instantaneous hazard function and by 



Kottas 



2003], who employed a Dirichlet process mixture of Weibull distributions 



to derive a Bayesian non-parametric estimate of t he survival density and of 



the hazard function. 



Haupt and Mansmann 



1995] employed this dataset to 



illustrate the non-parametric tree fitting techniques for survival data imple- 
mented in the S-plus function survcart. The aim of this Section is show- 
ing that the estimates of (b, £) obtained using the PHS algorithm and the 
approximate marginal posterior (14) provide meaningful inferences for the 
prognostic significance of the available covariates. For each covariate, the 
latter will be represented by its estimated posterior inclusion probability, 
i.e. by the proportion of sampled models which structure depends on the 
covariate. A PHS using twenty parallel chains was run for fifty thousand 
iterations, the starting tree for each chain being the root model. Consis- 
tently with the PHS algorithm described in Section 5.2, for this analysis we 
used a uniform swap proposal distribution q s (-)- On the top row, Figure 6 
shows the unnormalised log posterior tree probability for the models visited 
by the cold chain, plotted respectively versus the iteration index and versus 
their number of leaves. The posterior sampling for the cold chain moved 
quickly towards areas of high marginal posterior probability models, which 
leaf range is 10 — 14, the best tree having 12 leaves. The bottom plot of 
Figure 6 shows the estimated marginal posterior inclusion probabilities for 
all the covariates. According to these estimates, the covariates with maxi- 
mal prognostic significance are the diameter of the largest liver metastasis 
and the number of liver metastases, followed by their locoregional disease 
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Name 


Symbol 


Description 


1 


Diam. largest LM 


DLM 


(l,20)mm 


2 


Age 


AGE 


(18,88)years 


3 


Diagnosis of LM 


TD 


synchrone/metachron with CPT 


4 


Gender 


SEX 


M = 55.8%, F = 44.2% 


5 


Lobar involvement 


LI 


unilobar /bilobar 


6 


Number of LM 


NLM 


(1,20+) 


7 


Locoregional disease 


LRD 


yes/no 


8 


Metastatic stage 


TNM 


local/regional/distant 


9 


Location PT 


LOC 


colon/rectum 



Table 1: description of the covariates for the liver dataset. The data include several 
types of clinical covariates, such as continuous (DLM), discrete (AGE, NLM) and 
categorical (all others). 
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status, the patients'age at diagnosis, the localisation of their primary tumor 
and their lobar involvement status. The estimated inclusion probabilities of 
the remaining covariates suggest that, for this sample, their values do not 
discriminate among significantly different survival clusters. 

Figure 7 shows the structure of the estimated modal posterior tree. The 
depth of the leaves in this figure reflects the number of splits required to 
generate them. For any given tree structure (b, £), posterior inferences for 
(a^,/^) can be obtained by further simulation using their full conditional 
posterior distributions, which can be easily derived from the full conditional 
posterior (12). In order to maintain the focus of this Section on the appli- 
cation of PHS for tree selection, here we adopt the Kaplan- Meier (KM) sur- 
vival curves as non-parametric estimates of the survival probabilities within 
each leaf. Table 2 reports the number of patients clustered within each leaf 
of the modal tree and the values of their KM survival probabilities at 12, 
24 and 36 months. The bold figures in Table 2 correspond respectively to 
the highest and to the lowest estimated survival probabilities at the three 
time points. The lowest survival correspond to leaves number 1 and 2 for 
all the three time points. These two groups are defined by high values of 
the first two covariates in the tree, which are the diameter of the largest 
liver metastasis and the number of liver metastases. The highest estimated 
survival probabilities at 12, 24 and 36 months correspond to leaf number 
8 which is characterised by at most one liver metastasis of small diameter, 
local spreading of the cancer and no locoregional disease. 
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Figure 4: the top plots show the unnormalised log marginal posterior probability 
of the tree structure for the models visited along the PHS posterior simulation by 
the cold chain. The horizontal axis in the left plot represents the iteration index, 
whereas in the right plot it represents the number of leaves of the corresponding 
tree. The plot on the bottom shows the estimated marginal posterior inclusion 
probabilities for the nine covariates. Those with maximal prognostic significance 
are the diameter of the largest liver metastasis, the number of liver metastases, the 
locoregional disease status, the patients'age at diagnosis, the localisation of their 
primary tumor and their lobar involvement status. 
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4 


78 


0.85 


0.50 


0.28 


5 


42 


0.85 


0.50 


0.16 


6 


48 


0.80 


0.57 


0.26 


7 


31 


0.90 


0.81 


0.64 


8 


42 


1.00 


0.84 


0.77 


9 


30 


0.69 


0.25 


0.19 


10 


32 


0.65 


0.30 


0.10 


11 


42 


0.68 


0.59 


0.29 


12 


31 


0.97 


0.76 


0.29 



Table 2: number of observations falling in each leaf of the estimated posterior 
modal tree and Kaplan-Meier survival probabilities at 12, 24 and 36 months. The 
highest survival probabilities correspond to leaf number 8, which is defined by a low 
number of local metastases of small diameter and by the absence of locoregional 
disease. The lowest survival probabilities correspond to leaves 1 and 2, which are 
characterized respectively by metastases of large diameter and by a large number 
of smaller metastases. 
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Figure 5: tree structure of the estimated modal tree found by the PHS simula- 
tion. The lowest estimated survival probabilities at 1, 2 and 3 years correspond to 
leaves number 1 and 2, which are defined by high values of the two key covariates 
(NLM, DLM), whereas the best estimated survival corresponds to leaf number 8, 
which is defined by the non-linear interaction of (DLM,NLM,LRD and LOC). 
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6 Discussion 



This paper presents a novel multiple chains algorithm for Markov chain 
Monte Carlo inference, which we labeled parallel hierarchical sampler. We 
emphasized the main terms of comparison to evaluate the parallel hierarchi- 
cal sampler, that are t he M etropolis algorithm, the muliple-try Metropolis 
algorithm of ] 



Liu et al 



2000f | and parallel tempering. Whilst in the Metropo- 



lis sampler high a cceptance rates usually produce high autocorrelations and 



slow convergence ({Roberts et al. 



1997b)]), by construction PHS produces a 



chain which always moves but which exhibits low serial dependence. As 
it can be seen be comparing the PT and PHS joint transition kernels, re- 
ported by equations (6) and (8), the first advantage of PHS with respect to 
PT is that its swap proposal has a simpler form. This is because in PHS 
both types of transitions are performed at each iterations instead of being 
sampled according to the distribution q' s (si | Sj_i). The second practical 
advantage of PHS with respect to PT is that its implementation does not 
require choosing a set of temperature values. When little is known about 
the equilibrium distribution being studied, we regard this feature of PHS as 
a potentially major advantage. Furthermore, although the algorithm pre- 
sented in Section 2 allows for a different proposal distribution within each of 
the chains indexed m = 2, ...,M, this is not even a necessary requirements 
for the implementatio n of PHS. 



As pointed out by 



Geyer 



19911 ] , the attractive feature of multiple-chains 
MCMC samplers such as PT and PHS is that their target distribution factors 
into the product of the marginal distributions for each chain despite the fact 
that these chains are made dependent by the swap transitions. Under the 
conditions of Section 2 we prove in Theorem 1 that the samples generated 
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by the PHS algorithm converge weakly to such product distribution. 

In Section 2 we also noted that the complexity of multiple-chains transi- 
tion kernels, which for PT and PHS are mixtures of their marginal transition 
kernels, largely prevents a direct analytical comparison of their convergence 
properties. Direct comparison of the transition kernels (6) and (8) leads to 
major analytical difficulties and so far it has not been possible to e stablish 
an ordering between the two ker nels using t 



re criteria illustrated in 



1973], 



Meyn and Tweedie 



199J] and 



Mira 



Peskun 



200l|]. Although it falls beyond 



the scope of this paper, we regard the development of computable ordering 
criteria for mixtures of Markov chain transition kernels as a key area which 
deserves furt her investigation. 



Following 



Huelsenbeck et al. 



20011 ] , the last three Sections of this paper 



emphasise the relevance of multiple-chains MCMC algorithms for estimating 
the posterior model probabilities in a variety of settings. In Sections 3 and 
4 we provide two examples comparing numerically the inferences obtained 
by the Metropolis-Hastings algorithm with those of PT and PHS. In these 
two Sections we focussed on comparing the posterior inferences without 
considering the computational time required to produce them. The rationale 
behind this choice is that the computational time required by multiple-chains 
samplers is highly dependent on the available computational resources. For 
instance, if all chains used by PT and PHS are run in parallel on several 
processors, their run time may be comparable to that of the single-chain 
Metropolis-Hastings sampler, whereas if all chains are updated sequentially 
using one processor their run time is, of course, much longer. 

In Section 3 we observed that for the Gaussian clustering the PHS algo- 
rithm appears to explore the space of cluster configurations more effectively 
with respect to MH and PT using the same within-chain proposal mecha- 
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nism for all samplers. For the example of Section 4 we observed that using 
high swap proposal rates for the PT sampler leads to wrong estimates of 
the marginal posterior inclusion probabilities with and without collinearity 
among the predictors X. However, by measuring the precision of the three 
MCMC algorithms by their Markov chain standard errors we did not find 
a significant advantage of the multiple-chains samplers with respect to the 
MH algorithm. 

Section 5 illustrates the application of PHS for deriving inferences for 
the structure of a treed survival model. One of the main differences be- 
tween of Sections 4 and 5 is that the focus of the former is the selection of 
the relevant main regression effects whereas in the latter the key elements 
defining different survival groups are the non-linear interactions among the 
predictors defining the tree structure. The top-right plot in Figure 6 shows 
that the approximated marginal posterior probability of the tree structure 
under the Weibull model does not increase monotonically with the number 
of leaves. Under the Weibull model we used the Laplace expansion to ap- 
proximate the t ree structure marginal posterior probability. The Schwarz 



approximation (jSchwarz 



197a ]) was also considered. The penalty term of 
the Schwarz approximation increases with the model dimension, thus it rep- 
resents a cost for complexity factor. However, given a fixed number of leaves 
this approximation favours trees allocating the data more unevenly across 
leaves. Therefore, employing the Schwarz approximation when many covari- 
ates are available might result in assigning significant posterior probability 
to large and unbalanced trees, leading to overfitting small groups of survival 
data. On the other hand, the penalty associated to the Laplace approxima- 
tion has a complex form involving the tree size, the log Weibull parameters 
rji and the survival times along with their censoring indicators. Evalua- 
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tion of this approximation for a variety of tree structures showed that this 
penalty is strictly increasing with the tree dimension but it does not favour 
unbalanced trees. Using the Laplace expansion to approximate the model's 
marginal posterior probability and the PHS algorithm to sample from it, we 
find meaningful posterior inferences for a set of colorectal cancer survival 
data. The estimated modal tree separates the short-term survivors, who 
are characterised by a large number of liver metastases of large size, from 
the long-term survivors, who present a few local metastases of small size 
without further symptoms. 

Finally, in Sections 3, 4 and 5 we addressed qualitatively the issue of 
convergence of the chains produced by the three MCMC algorithms by con- 
sidering their acceptance rates and the fluctuations of their marginal pos- 
terior probabilities. Being the model spaces inherently non-metric, it was 
not possible to use the state-dependent criteria commonly used to assess the 



convergence of Mar 



Carlin and Cowles 



toy chains to their stationary distr ibutions illust rated in 



19961 ]. 



Roberts et al 



1997al or in 



Robert 



19981 ] among 



others. Although it is beyond the scope of this paper, in light of the increas- 
ing relevance of model selection problems we consider the development of 
appropriate convergence measures an important field for future research. 



Appendix 

The Laplace approximation is the second order Taylor expansion of the 
logarithm of the integrated posterior (14) around its posterior mode. In 
order to derive the approximation, it is convenient to parametrize equation 
(14) as a function of rjk = log(a^), so that the variables to be integrated out 
have support on the real line. Under this parametrization, stable estimates 
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of the posterior modes {fjk\k=i can be computed numerically. For each leaf 
the log integrated conditional posterior is 

l( m ) oc log(r(^<5 j l feJ )) + (r /jfc -l)^,5 j l feJ + e^^^log(t J )l fe , j 
j j j 

- £<^iog(£tf i feJ ). 

The penalty arising from the Laplace approximation is proportional to 
minus the logarithm of the second derivative of the log integrated posterior 
taken with respect to the leaf parameters The second derivative of the 

function l(rik) is 

him) = e Vk l^Sjlog^lkj J - J ' . 

Summing the approximation over the b leaves yields the right-hand side of 
equation (15). 
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